{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对预训练的模型进行微调\n",
    "===\n",
    "使用预训练的模型有很大的好处。它降低了计算成本，减少了你的碳足迹，并允许你使用最先进的模型，而不必从头开始训练一个。Transformers为广泛的任务提供了数以千计的预训练模型的访问。当你使用一个预训练的模型时，你在一个特定于你的任务的数据集上训练它。这被称为微调，是一种极其强大的训练技术。在本教程中，你将用你选择的深度学习框架对一个预训练的模型进行微调：\n",
    "* 用Transformers训练器对预训练的模型进行微调。\n",
    "* 在TensorFlow中用Keras微调一个预训练的模型。\n",
    "* 在本地PyTorch中对预训练的模型进行微调。\n",
    "\n",
    "## 准备一个数据集\n",
    "\n",
    "在对预训练的模型进行微调之前，先下载一个数据集，为训练做准备。前面的教程向你展示了如何为训练而处理数据，现在你有机会检验这些技能了!\n",
    "\n",
    "首先，加载Yelp评论数据集："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "dataset = load_dataset(\"yelp_review_full\")\n",
    "dataset[\"train\"][100]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "{'label': 0,\n",
    " 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\\\nThe cashier took my friends\\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\\\\"serving off their orders\\\\\" when they didn\\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\\\nThe manager was rude when giving me my order. She didn\\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\\\nI\\'ve eaten at various McDonalds restaurants for over 30 years. I\\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'}\n",
    "```\n",
    "\n",
    "正如你现在知道的，你需要一个标记器来处理文本，并包括一个填充和截断策略来处理任何可变的序列长度。要想一步到位地处理你的数据集，可以使用数据集映射方法在整个数据集上应用一个预处理函数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased\")\n",
    "\n",
    "\n",
    "def tokenize_function(examples):\n",
    "    return tokenizer(examples[\"text\"], padding=\"max_length\", truncation=True)\n",
    "\n",
    "\n",
    "tokenized_datasets = dataset.map(tokenize_function, batched=True)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果你愿意，你可以创建一个完整数据集的较小子集来进行微调，以减少所需时间："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "small_train_dataset = tokenized_datasets[\"train\"].shuffle(seed=42).select(range(1000))\n",
    "small_eval_dataset = tokenized_datasets[\"test\"].shuffle(seed=42).select(range(1000))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 训练\n",
    "\n",
    "在这一点上，你应该遵循与你想使用的框架相对应的部分。你可以使用右侧边栏的链接跳到你想要的那一个--如果你想隐藏某个框架的所有内容，只需使用该框架区块右上方的按钮就可以了!（这里选择Pytorch）\n",
    "\n",
    "## 使用PyTorch进行训练\n",
    "\n",
    "Transformers提供了一个为训练transformers模型而优化的训练器类，使其更容易开始训练，而无需手动编写自己的训练循环。训练器API支持广泛的训练选项和功能，如记录、梯度累积和混合精度。\n",
    "\n",
    "首先加载你的模型，并指定预期标签的数量。从Yelp Review的数据集卡中，你知道有五个标签："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from transformers import AutoModelForSequenceClassification\n",
    "\n",
    "model = AutoModelForSequenceClassification.from_pretrained(\"bert-base-cased\", num_labels=5)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 你会看到一个警告：一些预训练的权重没有被使用，一些权重被随机初始化。不要担心，这完全是正常的!BERT模型的预训练头被丢弃，并被随机初始化的分类头所取代。你将在你的序列分类任务上对这个新模型进行微调，将预训练模型的知识转移到它身上。\n",
    "\n",
    "### 训练超参数\n",
    "\n",
    "接下来，创建一个TrainingArguments类，其中包含所有你可以调整的超参数，以及激活不同训练选项的标志。在本教程中，你可以从默认的训练超参数开始，但可以自由地对这些参数进行试验，以找到你的最佳设置。\n",
    "\n",
    "指定保存训练中检查点的位置："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from transformers import TrainingArguments\n",
    "\n",
    "training_args = TrainingArguments(output_dir=\"test_trainer\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 评估\n",
    "\n",
    "训练器在训练期间不会自动评估模型的性能。你需要传给Trainer一个函数来计算和报告指标。Evaluate库提供了一个简单的精度函数，你可以用evaluate.load（更多信息见本快速指南）函数加载："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import evaluate\n",
    "\n",
    "metric = evaluate.load(\"accuracy\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "调用metric的compute来计算你预测的准确性。在把你的预测传递给计算之前，你需要把预测转换成对数（记住所有Transformers模型都返回对数）："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "def compute_metrics(eval_pred):\n",
    "    logits, labels = eval_pred\n",
    "    predictions = np.argmax(logits, axis=-1)\n",
    "    return metric.compute(predictions=predictions, references=labels)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果你想在微调期间监测你的评估指标，在你的训练参数中指定evaluation_strategy参数，以便在每个epoch结束时报告评估指标："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from transformers import TrainingArguments, Trainer\n",
    "\n",
    "training_args = TrainingArguments(output_dir=\"test_trainer\", evaluation_strategy=\"epoch\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 训练\n",
    "\n",
    "用你的模型、训练参数、训练和测试数据集以及评估函数创建一个Trainer对象："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "trainer = Trainer(\n",
    "    model=model,\n",
    "    args=training_args,\n",
    "    train_dataset=small_train_dataset,\n",
    "    eval_dataset=small_eval_dataset,\n",
    "    compute_metrics=compute_metrics,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后通过调用train()来微调你的模型："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "trainer.train()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 在原生PyTorch中训练\n",
    "\n",
    "Trainer负责处理训练循环，并允许你在一行代码中对模型进行微调。对于喜欢自己写训练循环的用户，你也可以在原生的PyTorch中微调一个Transformers模型。\n",
    "\n",
    "在这一点上，你可能需要重新启动你的笔记本或执行以下代码以释放一些内存："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "del model\n",
    "del trainer\n",
    "torch.cuda.empty_cache()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，手动对tokenized_dataset进行后处理，为训练做准备。\n",
    "\n",
    "1. 删除文本栏，因为该模型不接受原始文本作为输入："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "tokenized_datasets = tokenized_datasets.remove_columns([\"text\"])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 将标签列重命名为labels，因为模型期望该参数被命名为labels："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "tokenized_datasets = tokenized_datasets.rename_column(\"label\", \"labels\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. 设置数据集的格式以返回PyTorch张量而不是列表："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "tokenized_datasets.set_format(\"torch\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后如前所述创建一个较小的数据集子集，以加快微调的速度："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "small_train_dataset = tokenized_datasets[\"train\"].shuffle(seed=42).select(range(1000))\n",
    "small_eval_dataset = tokenized_datasets[\"test\"].shuffle(seed=42).select(range(1000))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据加载器\n",
    "\n",
    "为你的训练和测试数据集创建一个DataLoader，这样你就可以对成批的数据进行迭代："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from torch.utils.data import DataLoader\n",
    "\n",
    "train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)\n",
    "eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "用预期的标签数量加载你的模型："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from transformers import AutoModelForSequenceClassification\n",
    "\n",
    "model = AutoModelForSequenceClassification.from_pretrained(\"bert-base-cased\", num_labels=5)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 优化器和学习率调度器\n",
    "\n",
    "创建一个优化器和学习率调度器来微调模型。让我们使用PyTorch的AdamW优化器："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from torch.optim import AdamW\n",
    "\n",
    "optimizer = AdamW(model.parameters(), lr=5e-5)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从Trainer中创建默认的学习率调度器："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from transformers import get_scheduler\n",
    "\n",
    "num_epochs = 3\n",
    "num_training_steps = num_epochs * len(train_dataloader)\n",
    "lr_scheduler = get_scheduler(\n",
    "    name=\"linear\", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最后，如果你有机会使用GPU，请指定设备使用GPU。否则，在CPU上的训练可能需要几个小时，而不是几分钟的时间。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "\n",
    "device = torch.device(\"cuda\") if torch.cuda.is_available() else torch.device(\"cpu\")\n",
    "model.to(device)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 如果你没有像Colaboratory或SageMaker StudioLab这样的托管笔记本，可以免费使用云GPU。\n",
    "\n",
    "很好，现在你已经准备好训练了！\n",
    "\n",
    "### 训练循环\n",
    "\n",
    "为了跟踪你的训练进度，使用tqdm库在训练步骤的数量上添加一个进度条："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from tqdm.auto import tqdm\n",
    "\n",
    "progress_bar = tqdm(range(num_training_steps))\n",
    "\n",
    "model.train()\n",
    "for epoch in range(num_epochs):\n",
    "    for batch in train_dataloader:\n",
    "        batch = {k: v.to(device) for k, v in batch.items()}\n",
    "        outputs = model(**batch)\n",
    "        loss = outputs.loss\n",
    "        loss.backward()\n",
    "\n",
    "        optimizer.step()\n",
    "        lr_scheduler.step()\n",
    "        optimizer.zero_grad()\n",
    "        progress_bar.update(1)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 评估\n",
    "\n",
    "就像你给Trainer添加评估函数一样，你在编写自己的训练循环时也需要这样做。但不是在每个历时结束时计算和报告指标，这次你将用add_batch累积所有的批次，并在最后计算指标。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "import evaluate\n",
    "\n",
    "metric = evaluate.load(\"accuracy\")\n",
    "model.eval()\n",
    "for batch in eval_dataloader:\n",
    "    batch = {k: v.to(device) for k, v in batch.items()}\n",
    "    with torch.no_grad():\n",
    "        outputs = model(**batch)\n",
    "\n",
    "    logits = outputs.logits\n",
    "    predictions = torch.argmax(logits, dim=-1)\n",
    "    metric.add_batch(predictions=predictions, references=batch[\"labels\"])\n",
    "\n",
    "metric.compute()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 额外资源\n",
    "\n",
    "更多微调的例子，请参考：\n",
    "\n",
    "* Transformers Examples包括在PyTorch和TensorFlow中训练常见NLP任务的脚本。\n",
    "* Transformers Notebooks包含各种关于如何为PyTorch和TensorFlow中的特定任务微调模型的笔记本。\n",
    "\n",
    "---\n",
    "V4.30.0"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
